perm filename FINRAD.DMA[DMA,HE] blob sn#660656 filedate 1982-10-12 generic text, type C, neo UTF8
COMMENT āŠ—   VALID 00003 PAGES
C REC  PAGE   DESCRIPTION
C00001 00001
C00002 00002	CONSTRAINTS
C00025 00003	EVALUATION FUNCTIONS
C00059 ENDMK
CāŠ—;
CONSTRAINTS
-----------

Epipolar Geometry
-----------------

Constraints on the correspondence process are either photometric or
geometric in nature.

One of the primary geometric constraints used in stereo analysis is one
afforded by knowledge of the imaging configuration.  It is referred to in
the photogrammetry literature as `epipolar geometry'.  When the imaging
relative orientations are known, the search for correspondences between
left and right images can be restricted to a linear path.  To determine
the corresponding linear paths in a pair of images, the viewing geometry
must be analyzed. First, consider a restricted camera geometry, referred
to as the {\it `collinear'} camera model: here, the origin is located at
the focus of the left camera and the right camera focus lies on the $x$
axis.  The two image planes are coplanar and are perpendicular to the $z$
axis.  The baseline, $B$, is the distance between focii.  Given any point
on an object, we define an ``epipolar plane'' as that plane passing
through the object point and both focii.  This plane intersects the two
image planes, defining an ``epipolar line'' in each.  These lines are
parallel to the $x$ axis in our {\it collinear} camera model, and we will
refer to them as ``epipolars''.  Corresponding points must lie on
corresponding epipolars, that is lines with the same $y$-coordinate in
both left and right images.

If a stereo image pair were taken with a different (non-collinear)
geometry, a similar mapping could be obtained for corresponding
(conjugate) epipolars, although they would be neither parallel to each
other nor parallel to the scanning axis ($x$ axis) of the cameras.
[\hallert, page 27] discusses general epipolar geometry.

Two further geometric constraints used in limiting correspondence search,
are presented by Marr and Poggio [\marrpoggiob]; these are {\it
uniqueness} and {\it smoothness}.  Arnold and Binford [\arnbinf] present
geometrically-based estimation criteria for selecting corresponding image
features ({\it edges))}.  The Control Data Corporation mapping effort
introduces a set of geometric constraints specific to the domain of
imagery they wish to interpret:  rectilinear cultural objects.  Mayhew and
Frisby [\mayhewfrisbyaij] improve upon Marr and Poggio's specification of
a stereopsis process with the introduction of figural continuity
(geometric) constraints and the use of cross-channel correspondences based
upon `primal sketch' type descriptive primitives ([\marrearly]). They also
incorporate improved use of photometric information in their matchimg.
[Ryan 79] discusses the inherent inadequacies of photometric
cross-correlation matching.  These geometric and photometric constraints
are summarized in the following descriptions.

Marr and Poggio
---------------

Marr and Poggio's image interpretation constraints are derived from two
observations on the nature of surfaces in space:
     a) ``a given point on a physical surface has a unique position
	  in space at any one time.''
     b) ``matter is cohesive and separated into objects. Furthermore, the
	  surfaces of objects are generally such that changes in the surface are
	  very small compared to their distances from the viewer.''

From the first observation they extract the constraint that `each item from each
image may be assigned at most one disparity value'. Unfortunately, this
contradicts the usual interpretation of Panum's Limiting Case, where
it is known that one may fuse two elements from one eye with a single element
in the other eye (given limits on the separation of the two elements).
Their second constraint deals with limiting disparities in an area by presuming
that proximal image points should be proximal 3-space points.
With their first constraint, they can limit image point matches to 1-1 mappings;
the second allows them to use area disparity statistics to remove bad matches.

Their matching is based upon two properties of `edges' in the images
(edges are zero-crossings in a band-pass difference of gaussian
convolution).  These properties are zero-crossing {\it contrast} (positive
or negative) and very rough edge orientation estimates (quantized to 30
degrees, so slopes must be within approximately 60 degrees of eachother).
Matches are of {\it positive}, {\it negative}, and {\it zero} disparity,
relative to the vergence. These two measures to be used for matching
in the projective images are given no analytic derivation here, and are
strictly ad hoc.

Arnold and Binford
------------------

Arnold and Binford describe the derivation of two important analytic
functions based on geometric constraints for the matching of edges along
conjugate epipolar lines.  One concerns a constraint on the {\it
intervals} between adjacent edges and the other concerns a constraint on
the {\it orientation} of matched edges. The basis of the first constraint is
that a surface seen from one imaging site should, in general, have similar
dimensions to those it would have from another imaging site; in particular,
the projective widths should be roughly the same over most small to medium
baseline viewings.  The second constraint makes a similar statement about the
image plane orientation of scene edges.

Their results allow a
distribution function in the three-dimensional object space to be
translated to a distribution function in the two-dimensional image space.
The image space functions allow probability estimates to be made on the
likelihood that edges from the two images correspond. These estimates can
be used as quantitative measures for the selection of the best pairing of
edges among a set of alternatives.

The analysis deals with the case of edge features on corresponding
epipolar lines in the two images. The feature parameters of interest are
the position and orientation of edges, that is, the points at which image
edges intersect the epipolar, and the orientation of the edges at those
points.  The authors invoke the general assumption that edge and surface
orientations are not related to observer position.

Their edge orientation analysis is based on a mapping of projective edge
orientations in the image planes to a sphere of uniformly distributed
surface normals in three space.  The mapping of surface normals to image
edge orientations at various imaging baselines defines the distribution
function.
A similar mapping of surface orientations in three-space to their
projected intervals in two-space leads to similar constraints on the
distribution of interval correspondences.

CDC
---
The aim of this CDC work was to provide automatic reference preparation
capabilities for terminal homing; the references being structural models
of buildings which may then, at a later point, be used in scene
recognition for autonomous guidance.  Because of this aim, they addressed
their work specifically toward the problem of constructing planar models
of rectilinear cultural scenes from aerial imagery.  Constraints have been
built into the system to make it only applicable to planar surfaced
structures, and the search technique (dynamic programming) only accepts
transitions indicative of nearly horizontal or vertical surfaces $\ldots$ in
fact, they go to substantial effort to ignore surface detail (such as
roads, sidewalks, windows).

In a subsequent report, the group comments on the inadequacies of this
restrictive approach to stereo mapping.  They call it `blind' to the
surrounding context of the cultural scene, and argue that what is needed
are `techniques that are more akin to human picture processing where image
elements are symbolically related and where one exploits {\it a priori}
knowledge of how cultural structures are constructed.'  To meet this need
for geometric knowledge in the processing, they designed a `structural
syntax'.  The structural syntax was introduced as a set of geometric
principles specific to the sort of 3-d cultural scene their research
addressed. The syntax allowed them to match scene structures at a symbolic
level. Specifically, symbolic matching was applied only to roofs, and
these were restricted to horizontal and rectangular, L-shaped, or T-shaped
figures.

Three geometric principles were introduced as `structural syntax':

    1) The edge orientation principle:  Use of convergence of 3-space parallel
    lines to vanishing points, for clustering parallel (opposing) edges
    (utilization of vanishing points has also been suggested in [\liebes]).

    2) The principle of known transform slope:  Governs allowable 3-space
    orientations of edges, constraining surfaces to be either vertical or
    horizontal.

    3) The min-max transform principle:  Governs the (pre-determined) range of
    acceptable heights for structures.


Testing was done on real imagery from a small portion of one medium sized
building. Edges selected were quite conspicuous, and were oriented in only
one direction.  (Images were {\it collinearized} to have epipolar lines
parallel before stereo processing was begun.)  In their implementation,
the assumption was made that the orientation of buildings was given, so
the syntax could be applied at a very early level.  As a measure of the
robustness of their approach, it should be noted that a substantial amount
of operator intervention was required in the processing.

In summary, they use constraints to limit the orientations of buildings,
the slopes of scene surfaces (strictly horizontal or vertical), and the
height of structures.  Only roofs (or roof-like regions) are processed,
and among these, only those which have particular shape (T's, L's, or
rectangles).  Their constraints aren't based on analysis, but on
observations from a few highly restricted stereotypical cases.


Mayhew and Frisby
-----------------

[\mayhewfrisbyaij] discuss a significant variant to the theory of
stereopsis posed by Marr and Poggio [\marrpoggiob].  The authors show that
human vision utilizes more than simply zero-crossings in obtaining local
stereopsis; rather, they suggest that peaks in the signal (difference of
gaussians) also be used in the correspondence, and cite compelling
psychophysical evidence in support of this.  They are saying that
luminance variation, and not just luminance discontinuity, plays a role in
stereopsis. As with the Marr and Poggio approach, Mayhew and Frisby limit
correspondences to zero-crossings of same sign, and of roughly similar orientation,
and again this is done without adequate statistical analysis.

The authors feel that the goal of stereo processing is the
construction of a `Binocular Raw Primal Sketch (BRPS)', and feel this is
attained through the interaction of locally constrained matching and globally
constrained disambiguation.  The global disambiguation is effected through the
use of
{\it figural continuity} and {\it cross channel correspondences}. The figural
continuity constraint capitalizes on the connectedness of zero-crossings in the two
projective images, and with the restriction that connected edges have
connected correspondences, implicitly maintains the Marr and Poggio
constraint that zero-crossings be smooth in three-space (although their definition
of ``connectedness'' is not precise).
The correspondence algorithm they mention allows a feature in one of the
images to match {\it several} features in the other image.  This is in
sharp contrast to the matching algorithm of Marr and Poggio, where edge
correspondences are at most one-to-one. This enables stereopsis of Panum's
limiting case (but brings an untold increase in complexity to the matching
process).



Ryan and Hunt
-------------

[\ryana] investigates sources of digital correlation error and develops
image quality measures which can be used to predict the magnitude of
correlation errors in a particular region of an image.  They do this
with the long term aim of providing pre-screening capability that would
allow defective imagery to be identified, and enabling manual processing
or some sort of image enhancement for them.  The two primary sources of
difference between images are presented as resulting from {\it film or
sensor noise} (a weakly signal-dependent random quantity injected into
both images), and {\it relief distortion} (the effect of viewing
orientation and perspective on the relative attitude of scene surfaces). A
third source of image difference is mentioned as being that arising from
the sensing averaging process, where discretization of the original scene
intensity map over a fixed window leads to conjugate image points having
differing intensities.  Simulations of the image formation and correlation
processes (with two different correlation measures.. the Cramer-Rao, and
MSE) suggest that automated correlation may be slightly improved by
low-pass filtering the images, and they reference a paper by Heleva which
says that their exists a relatively narrow band of spatial frequencies at
which such correlators will function best. Equally, smooth images have
high error attributable to correlator `self noise', particularly where the
images have high signal-to-noise ratios and smooth correlation functions.
The determination of correlation failure is dependent upon the relief
distortion estimated for the scene.  A feature-based simulation was also
carried out, using as features over the window:  the Cramer-Rao measure,
brightness variance, contrast difference, contrast ratio, and brightness
median absolute difference (MAD).
Individual analyses of the various features indicated that MAD is
generally best, with contrast ratio worst.  Increasing window size
improves feature performance.  A final simulation feature test was the
calculation of a multiple linear regression fit to the date (with an 8 by
10 averaging window and 11 by 3 feature window).  The residual sum square
error was then calculated for the various
feature combinations.  The results indicate that variance with the
normalized Cramer-Rao measure provides the best fit.  The Cramer-Rao
measure is a lower bound on the variance of any unbiased estimate of the
parallax in an image-pair region, although its use presupposes an estimate
of the noise spectral density (which can be obtained by analyzing a
structure-free part of the image).  The experiments here suggest that
simple contrast measures are just as good at estimating correlation errors
as are theoretically more sensitive measures, such as the Cramer-Rao bound,

EVALUATION FUNCTIONS
--------------------

Marr and Poggio
---------------

The evaluation functions used in the stereo mapping system described
by Marr and Poggio apply to the selection of corresponding edges in the
two images and to the acceptance of edge matches in some local area.
None of the functions is based on deep analysis.  Edges are allowed
to match if their contrasts are the same and their image plane orientations
differ by less than 60 degrees (see Arnold and Binford for a superior analysis
of this parameter).  Local groups of edge matches are accepted if they have
similar disparity (with `local' defined by some sampling significance measure).
Their is nothing of quantitative value in their evaluation functions.

Arnold and Binford
------------------

Arnold and Binford's probabilistic estimates provide a quantitative basis
for using edge orientation and position in selecting the best matches
among a set of edges.

Their analysis suggests that the two functions which relate edge
orientations and interval widths across images are sharply peaked even for
the 60 degree vergence angles used in aerial photography.  When angles
corresponding to human vision are used, the conditions are extremely
strong. This leads them to recommend the use of mapping sequences with
small angle stereo.  For example, instead of a pair of images with a 60
degree baseline, a sequence of 10 images at 6 degrees would provide the
same overall baseline.  Tight constraints would simplify the task for the
program, which could track features from frame to frame, then make depth
estimates based on the accumulated baseline (see [\nevatiadepth]).

Their results lead them to conclude that in biological vision:  a) stereo
cells should show a half width of about 9 degrees for angle differences in
the two eyes; and b) such stereo cells will be insensitive to angles of
vectors in space.  However, those angles can be calculated accurately at a
later stage from associated vectors.

It is interesting to note that the underlying distribution assumptions
that lead to the results they cite may be modified by known distributions,
where they are available.  To first order, the assumption of uniform
distributions in the object space is useful.  However, the functions can
be made to incorporate knowledge of the scene when it is available.
Cultural scenes, for obvious structural reasons, tend to be strongly
oriented with respect to gravity.  Here, horizontal and vertical surfaces
predominate, and one would expect a strong bimodal surface distribution,
and a similar bias in the distribution of their projective interval
correspondences.

Baker
-----
Baker's system matches both edges and intensities, and uses statistical
measures for both in posing evaluation metrics.  At the edge level, his
system uses orientation in the image plane, intensity values to the sides,
interval widths, and (under certain conditions) contrast.  The uses of
orientation and interval width are based on analyses of Arnold and Binford
([\arnbinf]). The quantitative comparison of intensity values is attained
by modelling the intensity variation as a zero-mean gaussian process with
standard deviation specified by the estimated noise of the images; the
likelihood of two intensity values representing the same surface element
is approximated by the likelihood that image noise caused their intensity
difference. This uses the assumption of surface matte reflectivity. Noise
is estimated from the statistics of the image first difference.  Better
estimates of image noise than this may be obtained.

CDC
---
The correspondence metric used in the first Control Data Corporation
system is pixel intensity difference. Edges are found in the two images by
a simple (manually controlled) edge operator.  These edges are used to
bound the intervals of intensities that are to be matched. Correspondence
between edges is a side effect of the intensity correlation; edges
themselves are not compared. The search procedure operates over these
intensity intervals using a least squares minimization on the intensity
differences to choose the optimal interval correspondences.

Their approach here, however, is deficient for use in an automated system,
as it requires extensive operator intervention: in the first case to
specify initial edge correspondences, and in the second case to correct
the propagating interpretations when they err.  Clearly, though, some
measure on the intensity function is essential, as therein lies all scene
information.


SEARCH PROCESS
--------------

CDC
---
The search process used by the Control Data Corporation in their `Broken
Segment Matcher', a dynamic programming minimization of intensity differences,
was tailored specifically for the cultural domain they were
addressing in their research.  They took an interesting edge and
area-based approach to the problem, using edge information to guide the
application of a dynamic programming intensity matching process.  Roughly,
their algorithm functions as follows:

	\vbox{
	\vspsml\bullit{Geometrically transform a pair of images, bringing
	them into a {\it collinear} epipolar geometry (making corresponding lines
	parallel).}
	\vspsml \bullit{Locate (via a Sobel
	operator) and `thin' edges in the two images.}
	\vspsml \bullit{Establish
	edge correspondences in the first pair of epipolar lines -- by hand.} }
	\vbox{
	\vspsml	\bullit{Maintain two cooperating correspondence processes
	to minimize the
	effects of image noise and extraneous detail.  The first process matches
	intensities using only edges deemed to be `reliable', such as those seeded
	to the system
	through the manual startup.  The second process considers {\it all} edges,
	and, using the correspondences found by the first process for the
	particular line correlation it is presently performing, suggests a larger
	set of correspondences.  Those correspondences which are seen to `persist'
	over several preceding second process line analyses (suggesting that they
	arise from true scene geometric discontinuities) are given for
	consideration to the first process for its {\it next} line analysis.}  }


The algorithm progresses from one image epipolar line to another,
propagating results (to limit subsequent search) as it goes.  The
algorithm requires manual starting.  It propagates determined
correspondences along paths of proximal edges as it progresses from line
to line. As stated, correspondence is determined by a least-squares
minimization technique on interval intensity differences.  The algorithm
preprocesses the imagery data in a way that precludes it from working with
anything other than straight lines (as derived from sequences of edges) in
the images.  One point to note is that the search strategy they use --
having two correspondence processes, with the second introducing `new' and
removing `old' scene structures from the analysis, introduces a hysteresis
into the processing --- new structures (in the direction of processing)
take a while to be believed (`persist'), while old structures take a while
to disappear once passed.

In summary, their search for a 3-space interpretation of the imagery
involves obtaining first a hand-input initialization, determining a
least-squares mapping of intervals bounded by edges that `persist', and
using this mapping to limit the search for all other edges (bringing in
those that may be spurious).  Their approach, then is one of hand-seeding
a coarse-to-fine search process that evaluates matches on the basis of
sums of squares of intensity difference.  The search is restricted by the
constraints of interpretation as strictly vertical or horizontal surfaces.

In their follow-on work, they developed a `syntactic' approach to
involving the matching of image regions.  Two complementary algorithms
were developed for region matching.  Both centered on roof identification,
under the assumption that roofs were less likely to be occluded or contain
shadows.  Roofs were assumed parallel to the ground. Walls were to be
`dropped' to intersect the ground after roofs have been identified.  Both
systems use edge files to construct roof regions.  One of the two
algorithms `constructs regions individually and separately in each image',
and then performs the matching process between the images.  The other
first matches edges in the two images, using 3-space height as a
constraint, and then constructs regions from stereo pairs of lines.

	ALGORITHM 1: FIGURE MATCHING - Polygonal figures are found in the two
	images.  Symbolic stereo matching is guided by syntax directed corner
	matching.  Syntactic principles guide completion of incomplete line
	figures.

Tests were performed on manually produced line files, using real
camera model transformations.  The files corresponded to horizontal
surfaces.  Matching proceeded satisfactorily, and incomplete figures
seemed to be correctly completed.

	ALGORITHM 2: LINE SEGMENT MATCHING - This procedure involves pair-wise
	matching of lines from stereo images. Consistency metrics are introduced
	to measure the validity of matches. Application has been made to
	rectangular and tee-shaped roofs.  After rectangles and tees have been
	formed, they are compared, and `line-sharing-conflict' criteria are
	applied to choose a final set of building tops.


The authors acknowledge that `extensive testing and technique development
is still necessary within all areas.'  The present capability appears to
require a substantial degree of skilled operator intervention involving
iterative tuning and repeated passes through the process.

The constraints they have introduced here --- the scene is imaged
orthographically, the structures are strictly rectilinear, all vertical
surfaces are either parallel or orthogonal, and all horizontal surfaces
are parallel --- are severely restrictive.  In the context of their goals,
the constraints have been valid, and perhaps will even be useful for later
researchers.  Their contribution comes in identifying constraints relevant
to the specialized domain of rectilinear structures in cultural scenes.
The important question of identifying a scene as cultural in order to
allow this constrained interpretation is not addressed.


Baker
-----
Baker breaks his search for the correct interpretation of a scene viewed
projectively from two points into several parts.  It is an edge-based
line-by-line stereo correspondence scheme which utilizes image intensities
in determining a full depth map of the viewed scene. The phases of its
analysis are:
\vbox{
\vspsml \bullit{extracting edge
	descriptions of a pair of images (with a series of simple second difference
	zero-crossing edge operators), at several levels of image resolution}
\vspsml \bullit{linking these edges to their nearest
	neighbors to obtain the connectivity structure of the images,}
\vspsml \bullit{matching the edge descriptions up through the various resolution
	levels on the basis of local edge
	measures,}
\vspsml \bullit{cooperatively removing those edge correspondences formed by the
	matching which violate the connectivity structure of the two images, and}
\vspsml \bullit{matching intensity values across images in the more tightly
	constrained areas defined by the above matchings.}}

The system incorporates both an edge-based and an intensity-based analysis
in its processing. Edges are matched in a near-optimal line-by-line
dynamic programming correlation.  This correlation allows edges to remain
unpaired (from which they may be considered either occluded or spurious).
Edges in reduced resolution versions of the images are first correlated to
provide rough estimates of the scene correspondences.  These then
constrain the disparity values possible for edge matchings in the full
resolution images. The dynamic programming algorithm maximizes the
decision metric, giving the best (local to the line) interpretation of the
edges as part of a realizable scene.  Two-dimensional connectivity of the
edges is used to remove correspondences inconsistent with a single global
interpretation.  Remaining matches form a kernel of good correspondences
from which the further edge-based and then intensity-based dynamic
programming correlations complete the image matchings up to the point of a
full disparity map.

The strategy here was to first pair edges that are significant in some
sense (have a strong low frequency component), use these results to limit
the search for matching smaller scale detail (higher frequency), and then
fill in all possible surface detail by matching actual image intensity
values at this yet further constrained level.  The result of the
processing is a full image array disparity map of the scene. Dynamic
programming, in the form of the Viterbi algorithm, is used for all (there
are four) correspondence processes.

In summary, Baker's system uses collinear epipolar geometry, resolution
reduction, and four versions of Viterbi dynamic programming optimization
in its search for line-by-line image matches.  Global consistency is
obtained through a cooperative process which uses projective connectivity
to infer 3-space continuity.



Arnold System 1982
------------------

Arnold describes a feature-based stereo correspondence system.  The use of
epipolar geometry makes the matching a one-dimensional problem. Edge
continuity is utilized in combining matches along adjacent epipolar lines
to produce a match over an entire image.

An important component of the thesis is the derivation of two specific
analytic functions described earlier.  These functions lead to the
quantification of two matching constraints. The author has modified a
dynamic programming algorithm to incorporate these matching constraints
with an interpretation mechanism that provides an explanation for each
edge along an epipolar line.

A globally optimal match may be sub-optimal when viewed from the narrow
context of a single epipolar line.  Edge continuity across epipolar lines
is used to take the one-dimensional matchings into a globally consistent
two-dimensional matching.  This is achieved through an extension of the
Viterbi algorithm which produces for each line matching a list of all
matches scoring within a preselected range of the optimal match. This list
is filtered by an iterative process which enforces consistency among
adjacent epipolar lines.

The system assumes an input of linked edges; that is, extended edges
rather than edge elements (in fact, the current implementation is based on
straight line segments).  Input files were prepared by hand through
computer-assisted tracing of real imagery.  Effort was made to reflect the
imperfection of real data. Constaints based on image intensities were used
only very weakly, although future work with this system will correct this.

\ignore{When they are as specialized as these, constraints should be
	applied only when justified by conclusive evidence in the scene.
	More usefully, when the view may be not of a restricted domain,
	the constraints should be derived from observations on {\it general}
	cases and function to distinguish realizable from impossible or
	unlikely interpretations. CDC}

\ignore{Noticeable is the fact that the correlation performed on
	intensity values needs improvement; although its disparity context is
	quite well defined by the `kernel' of good edge correspondences, the pixel
	intensity correspondence is subject to line-by-line aberrations.
	Smoothness in disparities along the line is controlled by the
	optimization, but no such constraints exist for {\it between} line disparities
	-- a better, two-dimensional fitting algorithm is needed.  Results show
	the processing of two quite disparate scenes, but demonstration on more
	imagery is necessary to convince one of the algorithm's generality. Baker}